Between February of 2021 and April of 2022, a Penumbra-sponsored study collected medical data, encompassing social history to genetics from 400 different stroke patients. The goal of the study was to discover medically relevant details about patients that could further the understanding of stroke severity and treatment. This analysis explored over 10GB of data and found evidence that helps explain the variability in stroke severity among male and female patients.
Stroke is a leading cause of long term disability in the United States. According to the CDC, more than 795,000 Americans suffer from a stroke each year. A stroke is a medical emergency where either:
Both events can damage the brain and cause long-term disability or death. The severity of a stroke is often measured using the National Institutes of Health Stroke Scale (NIHSS). The final score (ranging from 0 to 42) is derived using 15 neurological examination questions and the stroke severity can roughly be interpreted using the following bands:
Between February of 2021 and April of 2022 various medical details were collected from 400 stroke patients as part of a study sponsored by Penumbra, a company that is developing products to help treat stroke patients. The data is broken into the following three categories:
The goal of this analysis was to identify any information contained in the study that could advance our understanding of strokes with the hope of improving patient treatment. Accordingly, this analysis focused on creating explanatory models as opposed to a predictive model.
Since the objective of this analysis was to construct an explanatory model, EDA played a key role in assessing data quality, choosing a response variable, selecting candidate predictors, and identifying any data handling techniques needed.
The hospital data had several variables that could be used to assess the severity of a patient’s stroke. Such variables include the scores assigned using the Glasgow Coma Scale (GCSSCTOT), National Institutes of Health Stroke Scale (NIHSSTOT), and Modified Rankin Scale (MRSSCORE). Changes in these scores between patient admission and discharge could also be considered to search for treatment effects. Choosing which variable to use as the response for this analysis boiled down to selecting the one with the fewest missing values. For admitted patients, the GCSSCTOT was missing for 335 of the 400 patients, the MRSSCORE was missing for 86 of the patients, and the NIHSSTOT score was only missing for 2 of the patients. Missing values were prevalent within the discharge details as well and as a result, the patient’s total National Institutes of Health Stroke Scale score (NIHSSTOT) at admission was selected as the response variable for this analysis.
Univariate statistics for the NIHSSTOT value can be viewed by expanding the drop down below. The first 2 plots show the distribution of the NIHSSTOT values and the bottom plot shows its cumulative distribution. From these plots we can see that our patient’s stroke severity scores are slightly positively skewed and that they cover the full range of scores with half of the patients having Mild to Moderate stroke severity scores and half having Severe to Very Severe scores.
## ------------------------------------------------------------------------------
## patients$NIHSSTOT (numeric)
##
## length n NAs unique 0s mean meanCI'
## 400 396 4 37 2 15.13 14.37
## 99.0% 1.0% 0.5% 15.89
##
## .05 .10 .25 median .75 .90 .95
## 3.00 5.00 9.00 15.00 20.00 25.00 28.00
##
## range sd vcoef mad IQR skew kurt
## 40.00 7.68 0.51 8.90 11.00 0.24 -0.40
##
## lowest : 0.0 (2), 1.0 (11), 2.0 (3), 3.0 (6), 4.0 (10)
## highest: 32.0, 33.0 (3), 34.0, 37.0, 40.0
##
## ' 95%-CI (classic)
Since the goal of this analysis was to create an explanatory model, candidate selection for potential predictors was done by manually reviewing the 240+ predictors to determine if there was enough data present to merit an analysis, if there were potential differences in centrality and spread, and if the predictor should reasonably be included in the model based on previous stroke analyses or demographic areas of interest. In the end, the following 23 predictors were selected as candidates for an explanatory model.
The summary statistics for each candidate predictor are provided in the tabs below and can be viewed by selecting the predictor tab and clicking the univariate statistics drop down arrow. The relationship between each candidate predictor and the response variable (NIHSSTOT) can be reviewed by expanding the bivariate statistics drop down.
The glucose levels of all of the patients in this study were measured in mg/dL. Notably, the data is positively skewed and has a slight positive correlation with stroke severity.
## [1] "Category: Baseline Laboratory Values"
## [1] "Description: Glucose"
## ------------------------------------------------------------------------------
## patients$GLUC (numeric)
##
## length n NAs unique 0s mean meanCI'
## 400 394 6 139 0 140.61 134.70
## 98.5% 1.5% 0.0% 146.53
##
## .05 .10 .25 median .75 .90 .95
## 89.00 96.00 107.00 122.50 149.75 201.70 260.45
##
## range sd vcoef mad IQR skew kurt
## 420.00 59.74 0.42 28.91 42.75 2.69 9.02
##
## lowest : 69.0, 73.0, 74.0, 80.0, 81.0 (2)
## highest: 408.0, 422.0, 438.0, 450.0, 489.0
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ GLUC (patients)
##
## Summary:
## n pairs: 400, valid: 390 (97.5%), missings: 10 (2.5%)
##
##
## Pearson corr. : 0.125
## Spearman corr.: 0.133
## Kendall corr. : 0.092
The white blood cell count was recorded in k/uL or 10^3 cells /mm^3 for nearly every patient. The units are equivalent and the standard range for white blood cell count count is 4 - 11 K/uL. One notable feature of the data is that it is positively skewed and has a slight positive correlation with stroke severity.
## [1] "Category: Baseline Laboratory Values"
## [1] "Description: White Blood Cells"
## ------------------------------------------------------------------------------
## patients$WBC (numeric)
##
## length n NAs unique 0s mean meanCI'
## 400 394 6 207 0 9.4129 8.9587
## 98.5% 1.5% 0.0% 9.8670
##
## .05 .10 .25 median .75 .90 .95
## 4.5755 5.0000 6.4300 8.8000 11.0000 14.3000 16.3050
##
## range sd vcoef mad IQR skew kurt
## 59.0600 4.5855 0.4871 3.4100 4.5700 4.2522 40.3834
##
## lowest : 2.04, 3.2, 3.26, 3.4, 3.8
## highest: 22.4, 23.74, 23.78, 28.1, 61.1
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ WBC (patients)
##
## Summary:
## n pairs: 400, valid: 390 (97.5%), missings: 10 (2.5%)
##
##
## Pearson corr. : 0.112
## Spearman corr.: 0.068
## Kendall corr. : 0.045
The red blood cell count was recorded (equivalently) in k/uL or 10^3 cells /mm^3 for nearly every patient. The data does not have any apparent outliers and is fairly symmetric in distribution. The red blood cell count, on its own, isn’t correlated with the stroke severity but was included since it helps describe the patient’s blood composition.
## [1] "Category: Baseline Laboratory Values"
## [1] "Description: Red Blood Cells"
## ------------------------------------------------------------------------------
## patients$RBC (numeric)
##
## length n NAs unique 0s mean meanCI'
## 400 384 16 201 0 4.3665 4.2914
## 96.0% 4.0% 0.0% 4.4417
##
## .05 .10 .25 median .75 .90 .95
## 3.0500 3.5100 3.8900 4.3750 4.8200 5.2400 5.5585
##
## range sd vcoef mad IQR skew kurt
## 5.8000 0.7488 0.1715 0.7042 0.9300 0.2679 1.5501
##
## lowest : 2.37, 2.41, 2.44, 2.6, 2.67 (2)
## highest: 6.07, 6.17, 6.3, 6.94, 8.17
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ RBC (patients)
##
## Summary:
## n pairs: 400, valid: 380 (95.0%), missings: 20 (5.0%)
##
##
## Pearson corr. : 0.039
## Spearman corr.: 0.022
## Kendall corr. : 0.012
Hematocrit is the percentage of red blood cells by volume. The data is fairly symmetric in distribution and doesn’t appear to have any notable outliers. Like the red blood cell count, Hematocrit on its own isn’t correlated with the stroke severity but was included since it helps describe the patient’s blood composition.
## [1] "Category: Baseline Laboratory Values"
## [1] "Description: Hematocrit"
## ------------------------------------------------------------------------------
## patients$HCT (numeric)
##
## length n NAs unique 0s mean meanCI'
## 400 392 8 187 0 39.22 38.62
## 98.0% 2.0% 0.0% 39.81
##
## .05 .10 .25 median .75 .90 .95
## 28.60 31.00 35.80 39.45 43.40 46.59 48.20
##
## range sd vcoef mad IQR skew kurt
## 37.70 5.99 0.15 5.86 7.60 -0.36 0.05
##
## lowest : 19.0, 21.1, 23.0, 23.9, 24.0
## highest: 51.1, 51.3, 52.4, 52.7, 56.7
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ HCT (patients)
##
## Summary:
## n pairs: 400, valid: 388 (97.0%), missings: 12 (3.0%)
##
##
## Pearson corr. : 0.034
## Spearman corr.: 0.026
## Kendall corr. : 0.017
Each patient’s hemoglobin level was measured in g/dl. The data is fairly symmetric in distribution and doesn’t appear to have any notable outliers. Hemoglobin on its own isn’t correlated with the stroke severity but was included since it helps describe the patient’s blood composition.
## [1] "Category: Baseline Laboratory Values"
## [1] "Description: Hemoglobin"
## ------------------------------------------------------------------------------
## patients$HBG (numeric)
##
## length n NAs unique 0s mean meanCI'
## 400 392 8 98 0 12.88 12.65
## 98.0% 2.0% 0.0% 13.10
##
## .05 .10 .25 median .75 .90 .95
## 8.80 9.80 11.60 13.10 14.40 15.50 16.20
##
## range sd vcoef mad IQR skew kurt
## 15.28 2.24 0.17 1.93 2.80 -0.56 0.57
##
## lowest : 3.72, 5.7, 6.4, 7.0, 7.1
## highest: 17.0, 17.1 (2), 17.2, 17.6 (2), 19.0
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ HBG (patients)
##
## Summary:
## n pairs: 400, valid: 388 (97.0%), missings: 12 (3.0%)
##
##
## Pearson corr. : 0.011
## Spearman corr.: 0.007
## Kendall corr. : 0.003
The CLCLTAR is the clot area. The units were not provided but were presumably entered in square millimeters. The data is positively skewed and does appear to have some outliers. Additionally, there were 42 patients that did not have a value for the clot area. The clot area was included since it has a slight positive correlation with stroke severity.
## [1] "Category: Histopathology Results of Thrombus Retrieval"
## [1] "Description: Clot area"
## ------------------------------------------------------------------------------
## patients$CLCLTAR (numeric)
##
## length n NAs unique 0s mean meanCI'
## 400 358 42 121 0 186.17 144.05
## 89.5% 10.5% 0.0% 228.28
##
## .05 .10 .25 median .75 .90 .95
## 8.55 20.00 40.00 84.00 150.00 350.30 640.00
##
## range sd vcoef mad IQR skew kurt
## 4'499.00 405.22 2.18 76.35 110.00 6.32 51.07
##
## lowest : 1.0, 2.0 (4), 2.5, 3.0 (2), 4.0 (4)
## highest: 1'800.0 (2), 1'848.0, 2'250.0, 3'500.0, 4'500.0
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ CLCLTAR (patients)
##
## Summary:
## n pairs: 400, valid: 354 (88.5%), missings: 46 (11.5%)
##
##
## Pearson corr. : 0.121
## Spearman corr.: 0.199
## Kendall corr. : 0.133
The CLCLTWT is the clot weight. The units were not provided and the data is positively skewed and does appear to have some outiers. Similar to CLCLTAR, the clot weight was not available for 42 of the patients. The clot weight was included since it has a slight positive correlation with stroke severity.
## [1] "Category: Histopathology Results of Thrombus Retrieval"
## [1] "Description: Clot weight"
## ------------------------------------------------------------------------------
## patients$CLCLTWT (numeric)
##
## length n NAs unique 0s mean meanCI'
## 400 358 42 158 0 104.99 82.72
## 89.5% 10.5% 0.0% 127.27
##
## .05 .10 .25 median .75 .90 .95
## 5.00 9.00 20.25 46.00 88.75 201.00 366.50
##
## range sd vcoef mad IQR skew kurt
## 1'562.50 214.32 2.04 43.00 68.50 4.62 23.56
##
## lowest : 0.5, 1.0 (3), 2.0 (3), 3.0 (2), 4.0 (8)
## highest: 1'350.0, 1'355.0, 1'430.0, 1'485.0, 1'563.0
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ CLCLTWT (patients)
##
## Summary:
## n pairs: 400, valid: 354 (88.5%), missings: 46 (11.5%)
##
##
## Pearson corr. : 0.099
## Spearman corr.: 0.181
## Kendall corr. : 0.122
The Age variable, measured in years, is symmetrically distributed without any outliers. Age, on its own, doesn’t appear to have a notable correlation with stroke severity but was included since it helps describe the patient’s demographics.
## [1] "Category: Demographics"
## [1] "Description: Age (years)"
## ------------------------------------------------------------------------------
## patients$AGE (numeric)
##
## length n NAs unique 0s mean meanCI'
## 400 400 0 67 0 68.92 67.46
## 100.0% 0.0% 0.0% 70.37
##
## .05 .10 .25 median .75 .90 .95
## 43.00 48.90 58.00 70.00 80.00 87.00 90.00
##
## range sd vcoef mad IQR skew kurt
## 71.00 14.80 0.21 16.31 22.00 -0.40 -0.40
##
## lowest : 27.0, 28.0 (2), 29.0, 30.0, 32.0
## highest: 94.0 (2), 95.0, 96.0, 97.0 (2), 98.0
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ AGE (patients)
##
## Summary:
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%)
##
##
## Pearson corr. : 0.096
## Spearman corr.: 0.099
## Kendall corr. : 0.069
The patient’s height was measured in centimeters and the values are fairly symmetric in distribution. Two of the patients had heights less than 95 centimeters which seemed unlikely and causes issues with BMI. As a result, the data associated with these patients was dropped from the analysis. The patient’s height does not have a notable correlation with the stroke severity but was included since it helps describe the patient’s physical attributes.
## [1] "Category: Demographics"
## [1] "Description: Height"
## ------------------------------------------------------------------------------
## patients$HEIGHT (numeric)
##
## length n NAs unique 0s mean meanCI'
## 400 382 18 69 0 169.72 168.51
## 95.5% 4.5% 0.0% 170.93
##
## .05 .10 .25 median .75 .90 .95
## 154.90 157.50 162.60 169.00 177.95 182.90 187.00
##
## range sd vcoef mad IQR skew kurt
## 130.00 12.00 0.07 11.86 15.35 -1.94 14.95
##
## lowest : 70.0, 93.98, 142.2, 147.3, 149.9 (2)
## highest: 193.0 (2), 195.0, 195.6, 198.1, 200.0
##
## heap(?): remarkable frequency (7.1%) for the mode(s) (= 160)
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ HEIGHT (patients)
##
## Summary:
## n pairs: 400, valid: 378 (94.5%), missings: 22 (5.5%)
##
##
## Pearson corr. : -0.063
## Spearman corr.: -0.037
## Kendall corr. : -0.026
Each patient’s weight was measured in kilograms and the values are slightly positively skewed. The patient’s weight does not have a notable correlation with the stroke severity but was included since it helps describe the patient’s physical attributes.
## [1] "Category: Demographics"
## [1] "Description: Weight"
## ------------------------------------------------------------------------------
## patients$WEIGHT (numeric)
##
## length n NAs unique 0s mean meanCI'
## 400 399 1 274 0 85.08 82.85
## 99.8% 0.2% 0.0% 87.31
##
## .05 .10 .25 median .75 .90 .95
## 55.18 59.00 69.00 81.90 97.75 116.12 130.00
##
## range sd vcoef mad IQR skew kurt
## 128.80 22.67 0.27 20.61 28.75 0.90 0.90
##
## lowest : 41.2, 42.2, 46.7, 47.7, 49.0
## highest: 150.0, 152.9, 154.0, 162.0, 170.0 (2)
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ WEIGHT (patients)
##
## Summary:
## n pairs: 400, valid: 395 (98.8%), missings: 5 (1.2%)
##
##
## Pearson corr. : -0.038
## Spearman corr.: -0.039
## Kendall corr. : -0.027
BMI is the patient’s Body Mass Index in kilograms / meter^2. The values are positively skewed and contain 2 outliers as a result of the patients with heights less than 95cm. BMI does not have a notable correlation with the stroke severity but was included since it helps describe the patient’s physical attributes.
## [1] "Category: Demographics"
## [1] "Description: Body Mass Index"
## ------------------------------------------------------------------------------
## patients$BMI (numeric)
##
## length n NAs unique 0s mean meanCI'
## 400 382 18 340 0 29.8525 28.7551
## 95.5% 4.5% 0.0% 30.9499
##
## .05 .10 .25 median .75 .90 .95
## 19.5380 21.5130 24.3000 28.0450 33.1175 39.2330 44.0970
##
## range sd vcoef mad IQR skew kurt
## 167.5800 10.9084 0.3654 6.2417 8.8175 7.7457 101.9419
##
## lowest : 16.07, 16.61, 16.83, 16.9, 17.5
## highest: 55.09, 60.95, 61.26, 73.07, 183.65
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ BMI (patients)
##
## Summary:
## n pairs: 400, valid: 378 (94.5%), missings: 22 (5.5%)
##
##
## Pearson corr. : 0.041
## Spearman corr.: -0.035
## Kendall corr. : -0.026
The Sex variable indicates if a patient is a male (1) or female (2). While there isn’t a noticeable difference in the mean stroke severity, the box plots have a difference in variance and sex was selected as a candidate variable as a result.
## [1] "Category: Demographics"
## [1] "Description: Sex"
## ------------------------------------------------------------------------------
## patients$SEX (character - dichotomous)
##
## length n NAs unique
## 400 400 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 2 206 51.5% 46.6% 56.4%
## 1 194 48.5% 43.6% 53.4%
##
## ' 95%-CI (Wilson)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ SEX (patients)
##
## Summary:
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 2
##
##
## 1 2
## mean 14.848 15.395
## median 15.000 15.000
## sd 7.227 8.092
## IQR 9.000 12.000
## n 191 205
## np 48.232% 51.768%
## NAs 3 1
## 0s 1 1
##
## Kruskal-Wallis rank sum test:
## Kruskal-Wallis chi-squared = 0.25255, df = 1, p-value = 0.6153
The MHNONE variable is ‘Y’ if there is no pertinent medical history for the Patient. Patients who indicated that they had no pertinent medical history appear (from the boxplots) to have less severe strokes and the factor was selected as a candidate as a result.
## [1] "Category: Medical History"
## [1] "Description: No Pertinent Medical History"
## ------------------------------------------------------------------------------
## patients$MHNONE (character - dichotomous)
##
## length n NAs unique
## 400 400 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## UNK 371 92.8% 89.8% 94.9%
## Y 29 7.2% 5.1% 10.2%
##
## ' 95%-CI (Wilson)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ MHNONE (patients)
##
## Summary:
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 2
##
##
## UNK Y
## mean 15.305 12.931
## median 15.000 13.000
## sd 7.701 7.201
## IQR 11.500 10.000
## n 367 29
## np 92.677% 7.323%
## NAs 4 0
## 0s 2 0
##
## Kruskal-Wallis rank sum test:
## Kruskal-Wallis chi-squared = 2.6381, df = 1, p-value = 0.1043
The MHPSIS variable indicates if a patient suffered from a previous ischemic stroke. Patients who suffered from prior Ischemic strokes appear to experience more severe strokes making this a good candidate predictor.
## [1] "Category: Medical History"
## [1] "Description: Previous Ischemic Stroke"
## ------------------------------------------------------------------------------
## patients$MHPSIS (character - dichotomous)
##
## length n NAs unique
## 400 400 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## UNK 344 86.0% 82.3% 89.1%
## Y 56 14.0% 10.9% 17.7%
##
## ' 95%-CI (Wilson)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ MHPSIS (patients)
##
## Summary:
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 2
##
##
## UNK Y
## mean 14.713 17.778
## median 14.500 19.000
## sd 7.680 7.218
## IQR 11.000 11.500
## n 342 54
## np 86.364% 13.636%
## NAs 2 2
## 0s 2 0
##
## Kruskal-Wallis rank sum test:
## Kruskal-Wallis chi-squared = 8.5409, df = 1, p-value = 0.003473
The MHPSTIA variable indicates if a patient had a previous transient ischemic attack. Only 10 patients indicated that they had a previous transient ischemic attack and their median stroke severity appears (based on the boxplots) to be slightly less than those who didn’t experience a previous transient ischemic attack. This was selected as a candidate predictor due to its relation to ischemic strokes.
## [1] "Category: Medical History"
## [1] "Description: Previous Transient Ischemic Attack"
## ------------------------------------------------------------------------------
## patients$MHPSTIA (character - dichotomous)
##
## length n NAs unique
## 400 400 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## UNK 390 97.5% 95.5% 98.6%
## Y 10 2.5% 1.4% 4.5%
##
## ' 95%-CI (Wilson)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ MHPSTIA (patients)
##
## Summary:
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 2
##
##
## UNK Y
## mean 15.153 14.300
## median 15.000 11.500
## sd 7.698 7.379
## IQR 11.000 12.250
## n 386 10
## np 97.475% 2.525%
## NAs 4 0
## 0s 2 0
##
## Kruskal-Wallis rank sum test:
## Kruskal-Wallis chi-squared = 0.12551, df = 1, p-value = 0.7231
MHDVT indicates if a patient has suffered from deep vein thrombosis. An interesting feature of the boxplots is that the data for patients who suffered from deep vein thrombosis appears to be positively skewed. This may suggest that these patients are less likely to experience mild to moderate strokes.
## [1] "Category: Medical History"
## [1] "Description: Deep Vein Thrombosis"
## ------------------------------------------------------------------------------
## patients$MHDVT (character - dichotomous)
##
## length n NAs unique
## 400 400 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## UNK 382 95.5% 93.0% 97.1%
## Y 18 4.5% 2.9% 7.0%
##
## ' 95%-CI (Wilson)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ MHDVT (patients)
##
## Summary:
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 2
##
##
## UNK Y
## mean 15.103 15.722
## median 15.000 15.000
## sd 7.695 7.607
## IQR 11.000 6.750
## n 378 18
## np 95.455% 4.545%
## NAs 4 0
## 0s 1 1
##
## Kruskal-Wallis rank sum test:
## Kruskal-Wallis chi-squared = 0.16318, df = 1, p-value = 0.6862
The MHDM variable indicates if a patient has diabetes. Patients with diabetes appear to suffer from more severe strokes on average. This aligns with GLUC numeric predictor.
## [1] "Category: Medical History"
## [1] "Description: Diabetes Mellitus"
## ------------------------------------------------------------------------------
## patients$MHDM (character - dichotomous)
##
## length n NAs unique
## 400 400 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## UNK 296 74.0% 69.5% 78.1%
## Y 104 26.0% 21.9% 30.5%
##
## ' 95%-CI (Wilson)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ MHDM (patients)
##
## Summary:
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 2
##
##
## UNK Y
## mean 14.646 16.529
## median 15.000 16.500
## sd 7.894 6.883
## IQR 11.000 10.750
## n 294 102
## np 74.242% 25.758%
## NAs 2 2
## 0s 1 1
##
## Kruskal-Wallis rank sum test:
## Kruskal-Wallis chi-squared = 5.5448, df = 1, p-value = 0.01854
The MHHTN variable indicates if a patient has Hypertension (High blood-pressure). Patients with Hypertension appear to have more severe strokes on average.
## [1] "Category: Medical History"
## [1] "Description: Hypertension"
## ------------------------------------------------------------------------------
## patients$MHHTN (character - dichotomous)
##
## length n NAs unique
## 400 400 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## Y 290 72.5% 67.9% 76.6%
## UNK 110 27.5% 23.4% 32.1%
##
## ' 95%-CI (Wilson)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ MHHTN (patients)
##
## Summary:
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 2
##
##
## UNK Y
## mean 13.636 15.706
## median 14.000 15.000
## sd 7.566 7.662
## IQR 10.000 11.000
## n 110 286
## np 27.778% 72.222%
## NAs 0 4
## 0s 0 2
##
## Kruskal-Wallis rank sum test:
## Kruskal-Wallis chi-squared = 5.4604, df = 1, p-value = 0.01945
The MHTHROMB variable captures if a patient has a medical history of Thrombocytopenia which occurs when platelet counts are low. There seems to be a difference in severity between patients who have a history of Thrombocytopenia and those who don’t but there are a lot of missing values.
## [1] "Category: Medical History"
## [1] "Description: Thrombocytopenia"
## ------------------------------------------------------------------------------
## patients$MHTHROMB (character)
##
## length n NAs unique levels dupes
## 400 400 0 3 3 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 UNK 372 93.0% 372 93.0%
## 2 N 19 4.8% 391 97.8%
## 3 Y 9 2.2% 400 100.0%
## ------------------------------------------------------------------------------
## NIHSSTOT ~ MHTHROMB (patients)
##
## Summary:
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 3
##
##
## N UNK Y
## mean 16.316 15.076 14.889
## median 15.000 15.000 12.000
## sd 6.750 7.724 8.403
## IQR 8.000 11.000 6.000
## n 19 368 9
## np 4.798% 92.929% 2.273%
## NAs 0 4 0
## 0s 1 1 0
##
## Kruskal-Wallis rank sum test:
## Kruskal-Wallis chi-squared = 0.92621, df = 2, p-value = 0.6293
The MHATEXCR variable captures if a patient has a medical history of Extracranial - Carotid Atherosclerosis which is a hardening and narrowing of vessels due to fat deposits. There seems to be a difference in severity between patients who have a history of Extracranial - Carotid Atherosclerosis and those who don’t but there are a lot of missing values.
## [1] "Category: Medical History"
## [1] "Description: Extracranial - Carotid Atherosclerosis"
## ------------------------------------------------------------------------------
## patients$MHATEXCR (character)
##
## length n NAs unique levels dupes
## 400 400 0 3 3 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 UNK 367 91.8% 367 91.8%
## 2 N 20 5.0% 387 96.8%
## 3 Y 13 3.2% 400 100.0%
## ------------------------------------------------------------------------------
## NIHSSTOT ~ MHATEXCR (patients)
##
## Summary:
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 3
##
##
## N UNK Y
## mean 16.579 15.027 15.923
## median 16.000 15.000 18.000
## sd 7.827 7.742 5.766
## IQR 11.000 11.000 6.000
## n 19 364 13
## np 4.798% 91.919% 3.283%
## NAs 1 3 0
## 0s 0 2 0
##
## Kruskal-Wallis rank sum test:
## Kruskal-Wallis chi-squared = 0.8365, df = 2, p-value = 0.6582
The MHPSISEL variable captures the previous type of ischemic stroke (when applicable). While there are quite a few missing values, it does appear that patients who had a cardio or cryptogenic Ischemic Stroke previously experience more severe strokes when compared against patients that didn’t have a previous ischemic stroke or had a LAA or SAO stroke.
## [1] "Category: Medical History"
## [1] "Description: Type of Previous Ischemic Stroke"
## ------------------------------------------------------------------------------
## patients$MHPSISEL (character)
##
## length n NAs unique levels dupes
## 400 400 0 5 5 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 UNK 345 86.2% 345 86.2%
## 2 CRYPTOGEN 29 7.2% 374 93.5%
## 3 CARDIO 15 3.8% 389 97.2%
## 4 SAO 8 2.0% 397 99.2%
## 5 LAA 3 0.8% 400 100.0%
## ------------------------------------------------------------------------------
## NIHSSTOT ~ MHPSISEL (patients)
##
## Summary:
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 5
##
##
## CARDIO CRYPTOGEN LAA SAO UNK
## mean 18.000 18.179 13.667 16.000 14.752
## median 20.000 18.500 13.000 15.000 15.000
## sd 6.370 7.414 6.028 8.699 7.702
## IQR 10.500 9.750 6.000 11.500 11.000
## n 15 28 3 7 343
## np 3.788% 7.071% 0.758% 1.768% 86.616%
## NAs 0 1 0 1 2
## 0s 0 0 0 0 2
##
## Kruskal-Wallis rank sum test:
## Kruskal-Wallis chi-squared = 9.0507, df = 4, p-value = 0.05984
The SHMRJYN variable indicates if a patient uses marijuana. The data for users appears to be positively skewed which may be an indication that users are less likely to experience mild to moderate strokes.
## [1] "Category: Medical History"
## [1] "Description: Marijuana Use"
## ------------------------------------------------------------------------------
## patients$SHMRJYN (character - dichotomous)
##
## length n NAs unique
## 400 400 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## UNK 378 94.5% 91.8% 96.3%
## Y 22 5.5% 3.7% 8.2%
##
## ' 95%-CI (Wilson)
## ------------------------------------------------------------------------------
## NIHSSTOT ~ SHMRJYN (patients)
##
## Summary:
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 2
##
##
## UNK Y
## mean 15.029 16.864
## median 15.000 15.500
## sd 7.740 6.527
## IQR 11.000 6.500
## n 374 22
## np 94.444% 5.556%
## NAs 4 0
## 0s 2 0
##
## Kruskal-Wallis rank sum test:
## Kruskal-Wallis chi-squared = 1.4397, df = 1, p-value = 0.2302
The SHALCUSE variable captures how many drinks a person estimates that they have per week. Although there are relatively low counts, it appears that the stroke severity of people who report drinking weekly is lower than it is for people who don’t.
## [1] "Category: Medical History"
## [1] "Description: Frequency of Alcohol Use"
##
## 1DRINK 2DRINK 3TO5DRINK GTE6DRINK UNK
## 24 6 20 27 323
## ------------------------------------------------------------------------------
## patients$SHALCUSE (character)
##
## length n NAs unique levels dupes
## 400 400 0 5 5 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 UNK 323 80.8% 323 80.8%
## 2 GTE6DRINK 27 6.8% 350 87.5%
## 3 1DRINK 24 6.0% 374 93.5%
## 4 3TO5DRINK 20 5.0% 394 98.5%
## 5 2DRINK 6 1.5% 400 100.0%
## ------------------------------------------------------------------------------
## NIHSSTOT ~ SHALCUSE (patients)
##
## Summary:
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 5
##
##
## 1DRINK 2DRINK 3TO5DRINK GTE6DRINK UNK
## mean 15.250 10.500 11.550 14.630 15.476
## median 15.500 8.500 11.000 14.000 15.000
## sd 5.944 8.620 7.409 7.692 7.753
## IQR 7.500 8.750 8.500 7.500 12.000
## n 24 6 20 27 319
## np 6.061% 1.515% 5.051% 6.818% 80.556%
## NAs 0 0 0 0 4
## 0s 0 0 0 0 2
##
## Kruskal-Wallis rank sum test:
## Kruskal-Wallis chi-squared = 7.9449, df = 4, p-value = 0.09362
Linear regression can be used to create an explanatory model that helps us understand which of the candidate factors selected in the EDA process contribute to higher stroke severity scores in patients. As noted in the EDA section, some of the candidate predictors are highly skewed, have outliers, and contain missing values all of which can pose challenges for linear regression. These issues were addressed via data selection, imputation, and data transformations.
The hospital data contains data for patients suffering from two distinct stroke types, ischemic and hemorrhagic. There were only 10 patients who both suffered from hemorrhagic stroke and had a NIHSSTOT score. To focus the study, we chose to exclusively examine ischemic stroke patients. Additionally, 2 of the patients had heights below 95 cm and were dropped from the study. This selection reduced the initial data set from 400 to 386 patients.
#Build LR data set
LR.data = patients %>%
filter(!is.na(NIHSSTOT) & IEESTRTY == "ISC" & !(SubjectID %in% c("00272-014","00122-001")) )%>%# Focus on ISC candidates and filter out patients with na HIHSSTOT
dplyr::select(SubjectID, NIHSSTOT # ID and response variables
, GLUC, WBC, RBC, HCT, HBG # Lab metrics from blood
, CLCLTAR, CLCLTWT# Clot metrics
, AGE, HEIGHT, WEIGHT, BMI # Numeric Demographics
, MHNONE, MHPSIS, MHPSTIA, MHDVT, MHDM, MHHTN, SEX, SHMRJYN, SHALCYN # Binary
, MHTHROMB, MHATEXCR # Quartary
, SHALCUSE, MHPSISEL # Quintary
)
Once the data of interest was selected, there were still a fair amount of missing values in the data. For Numeric data, missing values were replaced with the median value. Missing values for categorical variables were encoded as ‘UNK’ for unknown.
#LR.data$NIHISTOT = as.numeric(LR.data$NIHISTOT)
#LR.data$NPASS = as.numeric(LR.data$NPASS )
# For Numeric Data, impute the median into missing values
for(i in 1:11){
LR.data[is.na(LR.data[ , (i+2)]) , (i+2)] = median(LR.data[ , (i+2)][[1]], na.rm = T)
}
Power transformations were reviewed for each numeric variable that showed signs of skewness or outliers. The following transformations were made to the data to address skewness and reign in outliers.
The log-likelihood curve from box cox analysis (shown below) was used to select an inverse transformation for the glucose variable.
The box plots below show the improvements realized by the transformation.
The log-likelihood curve from box cox analysis (shown below) was used to select a log transformation for the white blood cell count variable.
The box plots below show the improvements realized by the transformation.
The log-likelihood curve from box cox analysis (shown below) was used to select a log transformation for the clot area variable.
The box plots below show the improvements realized by the transformation.
The log-likelihood curve from box cox analysis (shown below) was used to select a log transformation for the clot weight variable.
The box plots below show the improvements realized by the transformation.
The log-likelihood curve from box cox analysis (shown below) was used to select a negative square root transformation for the BMI variable.
The box plots below show the improvements realized by the transformation.
The SHALCUSE variable indicates how many drinks an individual has per week. This variable was converted to a numeric variable to reflect that as the value increases, so does the number of drinks consumed by the patient. This was the only candidate variable that needed to be refined using feature engineering.
LR.data = LR.data %>%
mutate(
SHALC = case_when(
SHALCUSE == "1DRINK" ~ 1
,SHALCUSE == "2DRINK" ~ 2
,SHALCUSE == "3TO5DRINK" ~ 4
,SHALCUSE == "GTE6DRINK" ~ 7
,TRUE ~ 0 ))
Variable selection was performed by considering all pairwise interactions between the numeric and categorical candidate features and selecting a subset that yielded strong model performance. This process compared the resulting R squared value, BIC, and RSS of models created using forward selection, backward selection, and sequential replacement. The resulting values are shown in the charts below which were used to determine that a linear model with 12 variables would likely explain 15 - 20 percent of the variability within stroke severity without drastically increasing the model BIC. A model with 12 parameters created using a sequential replacement process was selected as the final linear regression model and further refined to increase interpretability.
The parameters and diagnostic plots for the first iteration of the final model are shown below. The diagnostic plots indicate that the linear model fits the data reasonably well and that we can proceed with refining the model. We observe that the CLCLTAR.tran:MHHTN terms are not statistically significant. The first model refinement is to drop these from the model.
##
## Call:
## lm(formula = NIHSSTOT ~ AGE + SEX + GLUC.tran:SEX + WBC.tran:MHPSIS +
## WBC.tran:SHMRJYN + HCT:SEX + HBG:SEX + CLCLTAR.tran:MHHTN +
## CLCLTAR.tran:SEX + SHALC:MHPSIS + SHALC:MHDM + SHALC:MHHTN,
## data = LR.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.5453 -4.9156 -0.2949 4.4144 24.6650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 179.72033 250.54557 0.717 0.47364
## AGE 0.07285 0.02674 2.725 0.00674 **
## SEX2 -925.04733 340.51178 -2.717 0.00691 **
## SEX1:GLUC.tran -180.35530 252.32706 -0.715 0.47521
## SEX2:GLUC.tran 746.54666 245.05111 3.046 0.00248 **
## WBC.tran:MHPSISUNK 0.84498 0.92400 0.914 0.36107
## WBC.tran:MHPSISY 2.48533 1.02215 2.431 0.01552 *
## WBC.tran:SHMRJYNY 1.92957 0.75260 2.564 0.01075 *
## SEX1:HCT 0.12184 0.22112 0.551 0.58195
## SEX2:HCT 0.88788 0.33881 2.621 0.00915 **
## SEX1:HBG 0.07540 0.56638 0.133 0.89417
## SEX2:HBG -2.48925 0.94942 -2.622 0.00911 **
## CLCLTAR.tran:MHHTNUNK -0.10655 0.49936 -0.213 0.83116
## CLCLTAR.tran:MHHTNY 0.34837 0.47584 0.732 0.46457
## SEX2:CLCLTAR.tran 2.11258 0.64821 3.259 0.00122 **
## MHPSISUNK:SHALC 0.17884 0.33872 0.528 0.59783
## MHPSISY:SHALC -2.25467 1.01463 -2.222 0.02689 *
## SHALC:MHDMY 1.47885 0.52240 2.831 0.00490 **
## MHHTNY:SHALC -1.11189 0.43269 -2.570 0.01057 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.975 on 365 degrees of freedom
## Multiple R-squared: 0.2083, Adjusted R-squared: 0.1692
## F-statistic: 5.334 on 18 and 365 DF, p-value: 4.856e-11
Dropping the CLCLTAR.tran:MHHTN terms from the model does not drastically impact the R squared value. Reviewing the model we find that the statistically significant terms involving sex only indicate differences for females. The model can be simplified by Coding SEX as an indicator variable for female. The same can be done for the MHPSIS variable to indicate if the patient had a previous ischemic stroke. This was done to create the final model on the next tab.
##
## Call:
## lm(formula = NIHSSTOT ~ AGE + SEX + GLUC.tran:SEX + WBC.tran:MHPSIS +
## WBC.tran:SHMRJYN + HCT:SEX + HBG:SEX + CLCLTAR.tran:SEX +
## SHALC:MHPSIS + SHALC:MHDM + SHALC:MHHTN, data = LR.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.0587 -4.7880 -0.2755 4.6702 25.2325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.26026 248.61862 0.367 0.71378
## AGE 0.08286 0.02649 3.128 0.00190 **
## SEX2 -899.37944 342.10571 -2.629 0.00893 **
## SEX1:GLUC.tran -91.47090 250.40151 -0.365 0.71510
## SEX2:GLUC.tran 809.61116 244.65999 3.309 0.00103 **
## WBC.tran:MHPSISUNK 0.74869 0.92784 0.807 0.42024
## WBC.tran:MHPSISY 2.49040 1.02753 2.424 0.01585 *
## WBC.tran:SHMRJYNY 1.87816 0.75620 2.484 0.01345 *
## SEX1:HCT 0.15149 0.22188 0.683 0.49518
## SEX2:HCT 0.82477 0.33938 2.430 0.01557 *
## SEX1:HBG -0.03871 0.56698 -0.068 0.94561
## SEX2:HBG -2.33112 0.95170 -2.449 0.01478 *
## SEX1:CLCLTAR.tran 0.24355 0.47596 0.512 0.60917
## SEX2:CLCLTAR.tran 2.37560 0.44121 5.384 1.3e-07 ***
## MHPSISUNK:SHALC -0.05462 0.32346 -0.169 0.86600
## MHPSISY:SHALC -2.53661 1.01185 -2.507 0.01261 *
## SHALC:MHDMY 1.45564 0.52504 2.772 0.00585 **
## SHALC:MHHTNY -0.76512 0.40525 -1.888 0.05981 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.012 on 366 degrees of freedom
## Multiple R-squared: 0.1977, Adjusted R-squared: 0.1604
## F-statistic: 5.305 on 17 and 366 DF, p-value: 1.535e-10
The final model explains 16.5 percent of the variability in the stroke severity score among patients. The amount of variance explained compared to the total variance is visually represented using histograms on the following tab. The model has revealed some interesting details regarding the explanatory variables. The final model parameters estimates along with their p-values and confidence intervals are shown below. The diagnostic plots are also provided and indicate that the final model fits the data reasonably well. Model interpretations are provided in the following section.
##
## Call:
## lm(formula = NIHSSTOT ~ AGE + FEMALE + GLUC.tran:FEMALE + HCT:FEMALE +
## HBG:FEMALE + CLCLTAR.tran:FEMALE + WBC.tran:PREV_ISC + WBC.tran:SHMRJYN +
## SHALC:PREV_ISC + SHALC:MHDM + SHALC:MHHTN, data = LR.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.0359 -5.0470 -0.1371 4.5406 25.4849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.32666 1.84037 5.068 6.35e-07 ***
## AGE 0.07457 0.02574 2.897 0.003987 **
## FEMALE -865.40136 234.74403 -3.687 0.000261 ***
## FEMALE:GLUC.tran 860.03318 237.18279 3.626 0.000328 ***
## FEMALE:HCT 0.82557 0.33831 2.440 0.015141 *
## FEMALE:HBG -2.31997 0.94836 -2.446 0.014896 *
## FEMALE:CLCLTAR.tran 2.36233 0.43941 5.376 1.35e-07 ***
## WBC.tran:PREV_ISC 1.79133 0.49050 3.652 0.000297 ***
## WBC.tran:SHMRJYNY 1.87182 0.74565 2.510 0.012486 *
## PREV_ISC:SHALC -2.45977 0.96284 -2.555 0.011025 *
## SHALC:MHDMY 1.41145 0.50649 2.787 0.005597 **
## SHALC:MHHTNY -0.76885 0.26910 -2.857 0.004516 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.993 on 372 degrees of freedom
## Multiple R-squared: 0.189, Adjusted R-squared: 0.165
## F-statistic: 7.881 on 11 and 372 DF, p-value: 2.457e-12
The histogram below shows the distribution of the stroke severity score among patients. The wide spread illustrates the variance in the score.
The histogram below shows the distribution of the stroke severity scores predicted by the final model. The narrow spread illustrates the variance in predicted values. When we compare it against the first histogram, we observe the the model predictions are more normal in distribution and have a lower amount of variation. This depicts the amount of variation in stroke severity that the final model accounts for.
The intercept in this model can be interpreted as the expected stroke severity score for male patients that:
Age has a positive correlation with stroke severity score but it is more statistically significant than practically significant. The coefficient indicates that for each 13 year increase in age, the stroke severity is expected to increase by 1 point.
If the patient is a female, her expected stroke severity score depends on her Glucose levels, the size of the clot, her Hematocrit (percent of red blood cells), and her Hemoglobin levels. The charts below show how a female patient’s stroke severity score is expected to be impacted by these levels. The gray and black lines indicate the 1st quartile, median, and 3rd quartile for the glucose, clot area, hematocrit, and hemoglobin among females and the range on the X axis covers the max and min values measured in the study.
White blood cell count is positively correlated with stroke severity for patients that have had a prior ischemic stroke and or have history of using marijuana. The chart below shows how the stroke severity is impacted by white blood cell count. The green line indicates the trend for patients who have had a previous ischemic stroke but have not used marijuana previously. The blue line shows the trend for patients who have used marijuana previously and have had a prior ischemic stroke. The red line shows the trend for patients who have both had a previous ischemic stroke and have used marijuana in the past. The gray and black lines show the 1st quartile, the median, and the 3rd quartile.
Alcohol consumption was another factor that had an influence on a patients stroke severity but its influence depended on if the patient had a previous ischemic stroke, if they were diabetic, and if they had hypertension. The trends for each of these and the possible combinations are shown in the graph below. Alcohol consumption generally is associated with a decrease in stroke severity for patients without diabetes. This should be taken lightly since the majority of patients didn’t report how much alcohol they consumed. For those patients, their alcohol consumption was coded as a 0 and, as the chart indicates, no adjustment was made to their score.
The proteomics data set was evaluated using a differential expression analysis to search for biomarkers. This was done by regressing the NIHSSTOT response variable onto each protein using a linear model and reviewing the resulting p-values to determine if the protein was statistically significant. Two inherent challenges in this process were handling skewness in the protein data and accounting for false discovery among the large number of models created. These challenges were addressed using the following approaches:
As is sometimes the case, none of the q-values associated with the proteins were statistically significant. The search was expanded by adding the candidate predictors from the hospital data as covariates in the linear protein models. Additionally, new variable called STROKE_BELT was introduced to see if any biomarkers existed when the model accounted for regions in the US that have higher concentrations of strokes. This wider sweep of the data identified 1 biomarker which was present in the model that incorporated the individual’s sex.
The p-value plots below show the histograms of the p-values for the base models, the models that incorporate the patient’s sex, and the models that incorporate the stroke belt indicator. If there are no significant proteins, then we expect the p-values to be uniformly distributed. The chart below each histogram shows the proportion of truly null values as a function of the tuning parameter \(\lambda\). The cubic spline fit to the \(\pi(\lambda)\) vs \(\lambda\) is used to estimate the proportion of null values at \(\lambda =1\) for determining the q-values.
## Protein beta se p_value Q_VALUE Protein_IDs
## 1 P_351 24.49994 2.293611 3.971638e-05 0.03447381 P06493;Q07785;P61075
## Majority_protein_IDs
## 1 P06493
The only proteins with statistically significant q-values where those associated with protein batch 351 when the explanatory model includes sex as a covariate:
The plot below shows what the resulting model looks like. When stroke severity is regressed onto protein \(P_{351}\), there is a statistically significant relationship if sex is included as a covariate. Stroke severity increases at the same rate for females and males but there is an almost 10 point vertical shift between the lines indicating that females have higher stroke severity if everything else is held constant. Both genders experience higher stroke severity as the quantity of protein \(P_{351}\) increases. Unfortunately, the data set is rather small and there are only 9 observations used in the model meaning the addition of a single point that doesn’t fit the displayed trend could completely change the results. This is due to treating 0s as missing values for the protein data
##
## Call:
## lm(formula = response ~ predictor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2039 -0.6026 0.2002 1.0103 1.7581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -457.062 43.827 -10.429 4.56e-05 ***
## predictorFEMALE 9.868 1.262 7.821 0.000231 ***
## predictorP_351 24.500 2.294 10.682 3.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.68 on 6 degrees of freedom
## (65 observations deleted due to missingness)
## Multiple R-squared: 0.9544, Adjusted R-squared: 0.9392
## F-statistic: 62.74 on 2 and 6 DF, p-value: 9.504e-05
Protein 351 represents 3 different proteins: P06493, Q07785, and P61075. The majority protein is P06493. Details of each are provided below.
The proteins identified by our model are all Cyclin-Dependent Kinases (CDKs) which have previously been linked to stroke cases. While our model has failed to detect previously unknown biomarkers with regards to stroke severity, it has produced some evidence to support previous findings. The interested reader can learn more from the following links:
The final data extracted as a part of this study were Single Polynucleotide Polymorphisms (SNPs) which were collected for each of the 400 participants. SNPs represent pieces of human genetic code (DNA) where substantial variability occurs and are particularly useful for identifying disease causing genes. These data were evaluated using a similar approach to that used for evaluating the proteomics data but had their own unique set of challenges:
The first set of models evaluated on the SNP data were linear regression models where the stroke severity (NIHSSTOT) was used as the response and the SNP was used as the predictor along with the hospital data as covariates. P-values for each model were extracted using ANOVA tests to determine statistical relevance of a given SNP. Unfortunately, The histograms of the p-values of the resulting models are uniformly distributed which is a key indication that none of the SNPs are truly statistically significant. This was confirmed by calculating the q-values which resulted in no statistically significant SNPs.
LM1 = lm( formula = NIHSSTOT ~ SNP , data = SNP_MODEL_DATA)
SNP_MODEL_RESULTS_LM1 = read_csv("Project_Output/modelOutput/SNP_MODEL_RESULTS_LM1.csv")
hist(SNP_MODEL_RESULTS_LM1$p_value)
LM2 = lm( formula = NIHSSTOT ~ SNP + FEMALE , data = SNP_MODEL_DATA)
SNP_MODEL_RESULTS_LM2 = read_csv("Project_Output/modelOutput/SNP_MODEL_RESULTS_LM2.csv")
hist(SNP_MODEL_RESULTS_LM2$SNP_ANOVA_pvalue)
LM3 = lm( formula = NIHSSTOT ~ SNP + PREV_ISC , data = SNP_MODEL_DATA)
SNP_MODEL_RESULTS_LM3 = read_csv("Project_Output/modelOutput/SNP_MODEL_RESULTS_LM3.csv")
hist(SNP_MODEL_RESULTS_LM3$SNP_ANOVA_pvalue)
LM4 = lm( formula = NIHSSTOT ~ SNP + SHMRJYN , data = SNP_MODEL_DATA)
SNP_MODEL_RESULTS_LM4 = read_csv("Project_Output/modelOutput/SNP_MODEL_RESULTS_LM4.csv")
hist(SNP_MODEL_RESULTS_LM4$SNP_ANOVA_pvalue)
LM5 = lm( formula = NIHSSTOT ~ SNP + MHDMY , data = SNP_MODEL_DATA)
SNP_MODEL_RESULTS_LM5 = read_csv("Project_Output/modelOutput/SNP_MODEL_RESULTS_LM5.csv")
hist(SNP_MODEL_RESULTS_LM5$SNP_ANOVA_pvalue)
To widen the SNP analysis, a new response was selected. Some regions of the US are prone to higher stroke rates and are said to reside within the “stroke belt.” Each patient’s hospital location was known and the binary variable STROKE_BELT was created to indicate if the patient resided in the stroke belt. Logistic regression was used to model the STROKE_BELT variable against the SNP data with the hospital data as covariates. P-values for each logistic regression model were extracted using ANOVA tests to determine statistical relevance of a given SNP. The histograms of the p-values of the resulting models do appear to have more values concentrated around 0 (a good sign) but when the corresponding q-values were calculated, there were no statistically significant results.
LGR1 = glm( formula = STROKE_BELT ~ SNP , data = SNP_MODEL_DATA, family = binomial(link = “logit”))
SNP_MODEL_RESULTS_LGR1 = read_csv("Project_Output/modelOutput/SNP_MODEL_RESULTS_LGR1.csv")
hist(SNP_MODEL_RESULTS_LGR1$p_value)
min(SNP_MODEL_RESULTS_LGR1$q_value, na.rm = T)
## [1] 0.336149
LGR2 = glm( formula = STROKE_BELT ~ SNP + FEMALE , data = SNP_MODEL_DATA, family = binomial(link = “logit”))
SNP_MODEL_RESULTS_LGR2 = read_csv("Project_Output/modelOutput/SNP_MODEL_RESULTS_LGR2.csv")
hist(SNP_MODEL_RESULTS_LGR2$p_value)
min(SNP_MODEL_RESULTS_LGR2$q_value, na.rm = T)
## [1] 0.3584762
LGR3 = glm( formula = STROKE_BELT ~ SNP + PREV_ISC , data = SNP_MODEL_DATA, family = binomial(link = “logit”))
SNP_MODEL_RESULTS_LGR3 = read_csv("Project_Output/modelOutput/SNP_MODEL_RESULTS_LGR2.csv")
hist(SNP_MODEL_RESULTS_LGR3$p_value)
min(SNP_MODEL_RESULTS_LGR3$q_value, na.rm = T)
## [1] 0.3584762
This analysis tackled 3 different sources of data with the broad objective of finding any statistically significant details related to stroke severity. It focused on creating explanatory models and used an evidence focused approach. Evaluation of the hospital data resulted in a model that helped explain the difference in stroke severity variance between males and females. The model was able to explain approximately 16.5% of the variation in the entire data set. Analysis of the proteomics data again found that there are statistically significant differences in the stroke severity experienced by male and female patients and found evidence to back previous medical studies that connected Cyclin-Dependent Kinases (CDKs) to ischemic strokes. Finally, this study evaluated SNP data with the hopes of identifying genes that may cause strokes or that increase stroke severity and found that there was no evidence of either for the patients in the study.